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(57) ABSTRACT 

A method and apparatus including a plurality of data pro- 
cessing units. A plurality of memory banks having a shared 
address space are coupled to the processors by a crossbar 
coupling to enable reading and writing data between the 
processors and memory banks. A unidirectional network 
couples the memory banks and the processors to enable 
cache coherency messages to be transmitted from the 
memory to the processors. Aplurality of semaphore registers 
are implemented within the shared address space of the 
memory banks wherein the semaphore registers are acces- 
sible by the processors through the crossbar coupling. 
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SYSTEM AND METHOD FOR SEMAPHORE 
AND ATOMIC OPERATION MANAGEMENT 
IN A MULTIPROCESSOR 



BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates, in general, to microproces- 
sor systems, and, more particularly, to software, systems and 
methods for implementing atomic operations in a multipro- 
cessor computer system. 

2. Relevant Background 

Microprocessors manipulate data according to instruc- 
tions specified by a computer program. The instructions and 15 
data in a conventional system are stored in memory which 
is coupled to the processor by a memory bus. Computer 
programs are increasingly compiled to take advantage of 
paralleUsm. Parallelism enables a complex program to be 
executed as a plurality of similar or disjoint tasks that are 
concurrently executed to improve performance. 

Traditionally, microprocessors were designed to handle a 
single stream of instructions in an environment where the 
microprocessor had full control over the memory address 
space. Multiprocessor computer systems were developed to 
improve program execution by providing a plurality of data 
processors operating in parallel. Early multiprocessor sys- 
tems used special-purpose processors that included features 
specifically designed to coordinate the activities of the 
plurality of processors. Moreover, software was often spe- '^^ 
cifically compiled to a particular multiprocessor platform. 
These factors made multiprocessing expensive to obtain and 
maintain. 

The increasing availability of low-cost high performance 
microprocessors makes general purpose multiprocessing 
computers feasible. As used herein the terms microproces- 
sor" and "processor" include complex instruction set com- 
puters (CISC), reduced instruction set computers (RISC) 
and hybrids. However, general purpose microprocessors are ^ 
not typically designed specifically for large scale multipro- 
cessing. Some microprocessors support configurations of up 
to four processors in a system on a shared bus. To go beyond 
these limits, special purpose hardware, firmware, and soft- 
ware must be employed to coordinate the activities of the 
various microprocessors in a system. 

Inter process communication and synchronization are two 
of the more diflScult coordination problems faced by multi- 
processor system designers. Essentially, the problems sur- 
round coordinating the activities of each processor by jq 
exchanging state information between related processes 
running on different, and quite often autonomous, proces- 
sors. Inability to coordinate processor activities is a primary 
limitation in the scaleability of multiprocessor designs. 
Solutions to this problem becomes quite complex as the 55 
number of processors increases. 

State information is often embodied in a data structure 
called a "semaphore" and can be stored in a shared memory 
resource or semaphore register. A semaphore is essentially a 
flag or set of flags comprising values that indicate the status 50 
of a common (i.e., shared) resource. For example, a set of 
semaphores may be used to assert a lock over a particular 
shared resource. It is desirable to make semaphores available 
to all processors in a multiprocessor system. 

Semaphores are accessed by, modified by, and commu- 65 
nicated with various processes on an ongoing basis. Sema- 
phore manipulation typically involves a small set of rela- 
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tively simple operations such as test, set, test and set, write, 
clear, and fetch. These operations are sometimes performed 
in combination with some primary mathmatical or logical 
operation (e.g., increment, decrement, AND, OR). When 
semaphores are memory resident, access to the semaphores 
is accomplished in a manner akin to memory operations 
(e.g., read/write or load/store operations) in that the sema- 
phore data is read, updated, and written back to the sema- 
phore register structure. This process is often referred to as 
a "read-modify-write" cycle. 

These semaphore management operations typically 
involve transferring the semaphore to a processor's cache/ 
internal register, updating the semaphore value, and trans- 
ferring the updated semaphore back to the semaphore reg- 
ister structure. The semaphore manipulations must be 
atomic operations in that no processor can be allowed to 
manipulate (i.e., change) the semaphore value while a sema- 
phore management operation is pending or in flight (e.g., 
when the semaphore is being manipulated by another 
processor). Accordingly, memory-mapped semaphore 
manipulations imply a bus lock or other locking mechanism 
during a typical read-modify- write cycle to ensure atomicity. 
Bus locking, however, may not be possible unless all pro- 
cessors share a common bus, and significantly impacts 
performance and scalability of the multiprocessor design. 
Moreover, some mechanisms for ensuring atomic operations 
rely on special instmctions in the microprocessor instruction 
set architecture (ISA). Such a requirement greatly limits the 
flexibility in processor selection. A need exists for a method 
and system for manipulating memory-mapped semaphore 
registers that does not suffer the locking penalties associated 
with conventional atomic memory operations. 

Atomicity can be ensured by making the semaphore 
cacheable and using cache coherency mechanisms such as 
the MESI protocol to enforce atomicity. Alternatively the 
semaphore can be made uncacheable so that it exists only in 
the shared memory space and processor bus lock mecha- 
nisms used prevent all processor communication until the 
semaphore management operation completes. In either case, 
when a semaphore is being concurrently shared by a large 
processor count there are performance and implementation 
issues. 

Using a cached semaphores requires one processor to 
modify the semaphore and then propagate the modification 
to all other caches having copies of the semaphore. To 
migrate a cache line with write access from one processor to 
another quite often involves multiple memory read transac- 
tions along with one or more cache coherency operations 
and their accompanying replies. The latency of acquiring 
exclusive access to a cache line is a function of the number 
of processors that currently share access to the line. Because 
of this, using cache coherency mechanisms such as the 
MESI protocol do not scale well. Given that it is desirable 
to configure memory as cacheable (specifically, using a write 
allocate cache policy), and that the cache coherency protocol 
is designed to support upwards of 40 processors, inevitably 
there will be parallel applications where large processor 
counts wiU be using shared memory locations to synchro- 
nize program flow. 

Host bus locking ensures atomicity in a very brute force 
manner. The atomic operation support in the IA32 instruc- 
tion set with uncached memory requires two bus operations: 
a read, followed by a write. While these operations proceed 
a bus lock is asserted which prevents other processors from 
gaining access to and utilizing the unused bus bandwidth. 
This is particularly detrimental in computer systems where 
multiple processors and other components share the host bus 
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potentially creating conditions for system deadlock. Assert- FIG. 3 illustrates cache coherency mechanisms associated 

ing bus lock by any agent using the host bus will prevent the with a memory bank in accordance with the present inven- 

other processors from being able to start or complete any bus tion; and 

transaction targeting memory. FIG. 4 shows a memory mapping diagram illustrating 

Similar issues exist for any atomic memory operation. An 5 operation of a semaphore mechanism in accordance with the 

atomic memory operation is one in which a read or write P^sent invention; and 

operation is made to a shared memory location. Even when ^ illustrate an exemplary addressing 

the shared memory location is uncached, the atomic memory ^^^^^ contents of semaphores 

operation must be completed in a manner that ensures that ^ 

any processors that are accessing the shared memory loca- DETAILED DESCRIPTION OF THE 

tion are prevented from reading the location until the atomic PREFERRED EMBODIMENTS 

operation is completed. gewrai the present invention involves the aUocation of 

More complex multiprocessor architectures combine mul- a small, fixed range of shared memory as "semaphore 

tiple processor boards where each processor board contains registers". Semaphore registers are data structures that hold 

multiple processors coupled together with a shared front side " state information such as flags, counters and the like. Sema- 

bus. In such systems, the multiple boards are interconnected phore manipulation typically involves very "simple" 

with each other and with memory using an interconnect operations, that include: test, write, set, clear, test and set, 

network that is independent of the front side bus. In essence, and, fetch with some primary operation (i.e. increment, 

each of the multiprocessing boards has an independent front decrement, AND, OR, and the like). Semaphores are often 

side bus. Because the front side bus is not shared by all of used to share information and/or resources amongst a plu- 

the system processors, coherency mechanisms such as bus rality of software processes. Semaphore registers represent 

locking and bus snooping, which operate only on the front a type of uncached memory structure, 

side bus, are difficult if not impossible to implement. ^ important fcamre of the present invention is to provide 

Hence, semaphore management operations consume bus ^5 a scheme by which atomic operations can be performed on 

bandwidth that is merely overhead. Accordingly, it Ls desir- memory mapped registers using conventional "read" and 

able to provide a semaphore management mechanism and "write" memory references that are supported by virtually 

method that operates efiSciently to minimize overhead. More , all microprocessors. However, the present invention allevi- 

specifically, a means for providing semaphore management ates the need for a read/modify/write cycle in manipulating 

that does not rely on either cache coherency mechanisms or semaphores. 

bus locking mechanisms is needed. Iq accordance with the present invention, the semaphore 

registers reside on the memory banks and are allocated a 

SUMMARY OF THE INVENTION p^^tion of the shared address space so all processors in the 

Briefly stated, the present invention involves a method multiprocessor system have access to them with substan- 

and apparatus for implementing semaphores in a multipro- 35 tially uniform latency. Also, because shared memory is used, 

cesser including a pluraUty of data processing units. A existing processor-to-memory communication networks can 

plurality of memory banks having a shared address space are ^^d without need for a special-purpose network dedi- 

coupled to the processors to enable reading and writing data ^ated to managmg semaphore traffic. Also, semaphore 

between the processors and memory banks. A pluraUty of mampulations are accomplished by fundamenUl memory 

semaphore registers are implemented within the shared 40 operations such as read and write operations such that 

address space of the memory banks wherein the semaphore virtually any microprocessor and instruction set architecture 
registers are accessible by the processors using memory 

operations directed at the portion of the shared address space The present invention is illustrated and described in terms 

allocated to the semaphore registers. of a general-purpose multiprocessing computing system 

In another aspect the present invention involves a method comprising a number of substantially identical microproces- 
of operaUng a multiprocessor computing system in which a ^^^^"g integrated cache memory Although this type of 
pluraUty of processors generating memory requests. A plu- computing system is a good tool for illustrating the features 
rality of memory banks are provided that have a shared pnnciples of the present mvention, it should be under- 
address space and responsive to memory requests to read f ^ ^^^^ ^ heterogeneous set of processors may be used, 
and write data. A crossbar network couples the plurality of 50 Some processors may include integrated cache, some pro- 
processors with the pluraUty of memory banks. A portion of "'^y "^^^^^^ external cache, and yet other processors 
the shared address space in each memory bank is dedicated "^^y °^ ^^.^^^ ^h^ ^ lUuslr^icd in 
to semaphore registers. The dedicated portion of memory is ^^"^^ f. ^^^^ J^'^'^V. ^y^^f"^' ^""^ ^^^"^ ^P*^^^^ 
designated as uncacheable. A pluraUty of processes are ^/^^ appUcation m parUtioned memory systems as weU 
executed on one or more of the pluraUty of processors. At Accordmgly. the specific exarnples given herein are supphed 
runtime, a portion of the shared address space is allocated to ^"^^"^^ °^ lUustration and understandmg and are not to 
the pluraUty of processes. Preferably, at least one of the ^^^^^ ^ Umitations of the mvenUon except where 
physical semaphore registers in a particular memory bank is ^^'^"^^ly Moreover, an important feature of the 
mapped into the common address space allocated to the P«^«^°^ '"^^"^^^"^ ^ ^^^^ is readdy scaled upwar^dly and 
pluraUty of processes in the particular memory bank. 6° downwardly to meet the needs of a particular application. 

Accordingly, unless specified to the contrary the present 

BRIEF DESCRIPTION OF THE DRAWINGS invention is appUcable to significantly larger, more complex 

network environments as well as small network environ- 

FIG. 1 shows a multiprocessor computer environment in ments such as conventional local area network (LAN) 

which the present invention is implemented; ^5 systems. 

FIG. 2 shows portions of an exemplary multiprocessor in FIG. 1 shows a multiprocessor computer environment in 

accordance with the present invention; which the present invention is implemented. Multiprocessor 
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computer system 100 incorporates N processor boards 101. invention. In the implementation of FIG. 2 there are sixteen 

Each processor board 101 is logically referred to as a segments labeled SEGMENT_0 through SEGMENT_15. 

processor node 101. Each processor board 101 comprises Each segment includes a processor group 201. A processor 

one or more microprocessors, such as processors PI and P2, group 201 in a particular example includes thirty two 

having integrated cache memory in the particular examples. 5 processors, each coupled to processor switch 202 through a 

Processor boards 101 may be configured in groups sharing bi-directional data and command interface. Processor switch 

a common front side bus (FSB) 104 and sharmg a common 202 includes an output to a uimk line 214 for each memory 

gateway through a bridge 107 to host bus network 102. M ^ank group 205. Similarly, each memory switch 203 

exemplary processor is the PenUum® III Xeon^" processor ^^^j^^^^ ^ ^ j.^^ 214 for each processor 

manufactured by Intel Corporation which can be config^^^^^ 201, In this manner, any processor group can be 

as smgle processors and symmetric multiprocessors (SMP) 1*1 ij* li *u u 

of up To four processors. Clustered designs of multiple SMP selectively coupled to any memory bank group through 

systems are also available. appropriate configuration of processor switch 202 and 

Processors 101 are bidirectionally coupled to shared "memory switch 203. 

memory 103 through interconnect network 102. Intercon- F^G. 3 shows important semaphore management mecha- 

nect network 102 preferably implements a full crossbar nisms associated with a memory bank 205 in accordance 

connection enabling any processor board 101 to access any with the present invention. Memory switches 203 commu- 

memory location implemented in any memory bank 105. nicate with trunk lines 214 (shown in FIG. 2) to send and 

Shared memory 103 is configured as a plurality M of receive memory access requests to memory controller 301. 

memory banks 105. Each memory bank 105 may itself Upon receiving a memory access request, memory switch 

comprise a group of memory components. Preferably shared 20 203 passes information including the target memory address 

memory 103 is organized as a plurality of "lines" where each and processor node identification, as well as control and 

line is sized based on the architecturally defined line size of mode information to memory controller 301. The target 

cache within processors 101. A line in memory or cache is memory address refers to a location in memory bank data 

the smaUest accessible unit of data although the present portion 302 or a portion of the memory address space that 

invention supports memory architectures that permit 25 has been allocated to semaphore controller 302. 

addressing within a line. ^ ,^ . , . . 

Each processor board 101 may include a front side bus Jhe processor ID is a value indicating a unique processor 

(FSB) gateway interface 106 that enables access to local ^" ^ multiprocessor system that is conducting the 

memory 108 and peripheral component interconnect (PCI) ^^^^^^Hf operation. In a parUcular embodiment this mfor- 

bridge 110. In the particular examples local memory 108 is 30 °^aUon is passed between switch 203 and memory controller 

not included in the address space ofshared memory 103 and ^^^"^ memory bank 301 as a data packet having 

is shared only amongst processors PI and P2 coupled to the defined fields for the vanous types of information. The 

same front side bus 104 as the FSB crossbar 106. PCI bridge specific layout of this data packet is chosen to meet the needs 

110 supports conventional PCI devices to access and ^f a particular implementation. 

manage, for example, connections to external network 111 35 Most of the shared address space is allocated for data and 

and/or storage 112. It is contemplated that some processor instructions in memory 302. Memory 302 is organized as a 

boards 101 may eliminate the PCI bridge functionality plurality of memory lines 312, also called cache lines. In a 

where PCI devices are available through other boards 101. particular example each memory line 312 is 256 bits wide 

Significantly, the front side bus 104 of each processor and memory 302 includes a variable number of Unes 

board 101 is independent of the front side bus 104 of all 40 depending on the amount of physical memory implemented, 

other processor boards 101. Hence, any mechanisms pro- Memory 302 is allocated to executing processes using 

vided by, for example, the IA32 instruction set to perform available memory management and allocation mechanisms 

atomic operations will not work as between processors in a substantially conventional manner. Typically, a group of 

located on different boards 101. executing processes will share a common address space that 

Memory operations are conducted when a processor PI or 45 ^ ^"ocated to those processes at runtime. lypicaUy all or a 

P2 executes an instmction that requires a load from or store sigmficant porUon of the conventional memory area 302 is 

to a target location in memory 103. In executing a memory designated as cacheable memory. 

operation, the processor first determines whether the target Cache coherency unit 305 operates in conjunction with 
memory location is represented, valid and accessible in a cache directory 304 to manage cache coherency across the 
cache. The cache may be onboard the processor executing 50 multiple processors that may have cached copies of cache- 
the memory operation or may be in an external cache able memory locations. Each entry 314 corresponds to a 
memory. In case of a cache miss, the memory operation is memory line within memory 302. Cache coherency chip 301 
handled by bridge 107. Bridge 107 generates a access may be implemented as a custom integrated circuit such as 
request to host bus network 102 specifying the target loca- an ASIC, a one time or re-programmable logic device such 
tion address, operation type (e.g., read/write), as well as 55 as a programmable gate array, or as discrete components 
other control information that may be required in a particular coupled in a conventional circuit board or multi-chip mod- 
implementation. Shared memory 103 receives the request ule. Cache coherency chip 301 uses the memory address to 
and accesses the specified memory location. In the case of access cache coherency directory 304. Cache coherency 
a read operation the requested data is returned via a response directory 304 includes a multi-bit entry 314 for each 
passed through host bus network 102 and addressed to the 60 memory line in the shared memory address space of the 
bridge 107 that generated the access request. A write trans- particular memory bank data portion 320. Cache directory 
action may return an acknowledgement that the write 314 includes a plurality of entries 314, each 36 bits wide in 
occurred. In the event an error occurs within shared memory a particular example. Each entry 314 contains a value 
103 the response to bridge 107 may include a condition code indicating the current state of the corresponding memory 
indicating information about the error. 65 1^°^- 

FIG. 2 illustrates a specific implementation and ictercon- In accordance with the present invention, a portion of the 

nect strategy supporting implementations of the present shared address space of each memory bank is allocated to 
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hardware semaphores, hereinafter referred to as the "hard- shown in FIG. 4, a number of physical memory lines are 

ware semaphore portion". References to the hardware sema- allocated to hardware semaphore registers 303. In the par- 

phore portion are sent to semaphore controller 302 rather ticular example, each memory line is 64 bits wide so that in 

than conventional memory portion 302. The hardware sema- normal operation memory reads and writes are performed in 

phore portion of the address space is designated as uncache- 5 8_byte wide groups of data. 

able. In a particular example, the size of hardware sema- ^n. u j u u 1 j 1 . 

phore portion is selected to allocate a fixed address space of The hardware semaphore memoir area 303 holds a cluster 

about 4K byte to each physical processor in the system. °^ hardware semaphore registers TTje cluster of registers is 

Hence, a system with 321 piocessois will allocate a total of preferably mapped to a common Imear address space shared 

1.25 MB. spread amongst the memory banks 205, to hard- . a plurality of processes. It should be noted that the 

ware semaphore controller 303. In an exemplary system the memory management system and/or microprocessor archi- 

lotal address space available is in the order of 64 GB or '^'^^'^ ™P°^ practical limn on the size and organi- 

moie. Hence, the portion allocated to hardware semaphores nation of semaphore registers. In the partioilar exatnples. 

K relatively small semaphore clusters are allocated on 4 KB boundaries 

Normal memory readA^ite operations address locations ^'"^ the virtual memory (VM) management of the pro- 

within conventional memory portion 302 as the executing ' cessor prowdes for muluprocessmg protecUon mechanisms 

processes cannot be assigned address space within the gnnulanty. Managmg the semaphore register 

hardware semaphore porti^ by the virtual memory man- "7°'^^ allocatmg or assigning a particular hard- 

agement system. The present invention introduces a new semaphore register withu controller 304 to a particular 

system call into the operating system code to map one or ,„ P.™*^ system so as to avoid assignment of a 

more of the physical semaphore registers within semaphore ^f^" to unreUted processes Management at a 

controller 302 into the process' common address space. TOs ^^"^^^ j?".^^ P«'^«'« ^^'^'^ over, for example, aUocat- 

enables the processes to read and write data with the ""g. '■'dividual or smaU groups of hardware semaphore 

hardware semaphore registers by conventional memory '*8'^'«^ °" » register-by-reg^ter basis. 

Operations ^ illustrated in the exploded portion of FIG. 4 each 

For any multiprocessor system having a number "n" "^^"'^H^ ^^^^ ^^^^^ ^i^^er one (1) 64 bit or two (2) 32-bit 

physical processors, there may be «n" processes executing in semaphore registers two semaphore registers 403 In the 

addition to an OS process executing at any given time. Each Particular examples, each semaphore reg^ter 403 in the 

process should have its own semaphore register, hence the exemplary implementation is 32.bits wide. The meaning and 

system should support n+1 semaphore registers. 30 ^^^^^ ^ semaphore is at the discretion of the 

Atomicity ofsemaphore operations is important to ensure ca ion i e . 

that any operation that manipulates a semaphore value is F^G. 5 and FIG. 6 illustrate an exemplary addressing 

completed before another operation that reads or manipu- format used to read and write the contents of semaphores 

lates the semaphore can take place. In the preferred 303. These examples assume a 32-bit virtual address (shown 

implementation, serialization of memory operations is under 35 ^ 5) and a 36-bit physical address (shown in HG. 6) 

the control of bridge controUer 107 shown in FIG. 1. Bridge 1° both cases, bits [0:2] are byte offset bits provided to 

controller 107 includes mechanisms referred to as "fence determine whether the memory is being referenced as a 64 

operations" that impose order on memory operations that or 32 bit operation, bits [3:4] are operation code specifiers 

affect uncached address space. A programmer uses the fence register and bits [5:11] indicate a particular sema- 

operations to ensure correctness. These mechanisms ensure 40 P*^®^® ^th*° * cluster. Hie remaining bits [12:31] of the 

that uncached memory references are completed before virtual address indicate the vimial base of the semaphore 

allowing any cached read/write or uncached read operations cluster of mterest. In the physical address formal shown in 

to proceed. Uncached memory references are memory FIG. 6, bits [12:19] indicate the cluster number to identify a 

operations that specify an uncached area of the address particular cluster within the plurality of clusters shown in 

space, including to the hardware semaphore portion of the 45 ^f^' 4- physical base of the semaphores is indicated by 

address space. These mechanisms operate in a similar man- ^^^s [20:35] as shown in FIG. 6. 

ner in conjunction with the present invention to ensure that To access .any specific 32-bit hardware semaphore within 

semaphore manipulations, which appear to bridge controller a cluster the address is calculated by combining the virtual 

107 as uncached memory operations, are serialized. This cluster base address with the semaphore number and the 

implicitly guarantees that all references to the uncached 50 read/write operation code. The operation code is encoded 

hardware semaphore area 303 will be serialized. This func- into the word select (WS) bits as indicated in Table 1. 
tionality is akin to the prior methods of stalling the memory 

bus during a semaphore write operation. However, the TABLE 1 
negative impacts are significantly curtailed by stalling these 

memory transactions in the manner described herein. It is 5s 
contemplated that semaphore modification operations will 
take no more than six clock cycles to complete as compared 
to the upwards of hundreds of clock cycles previous bus 
stalling techniques incurred. 

FIG. 4 shows a conceptual diagram iUustrating an exem- 60 
plary layout of the semaphore registers within the context of 

the entire memory address space. The linear address space ^h® following briefly summarizes the read operations 

401 represents the common block of address spaced descnbed in Table 1: 

assigned to a given set of independent or common executing Test/ShrRead (WS«0Ob) 

processes. Physical address space 402 represents the avail- 65 ^^ad and retum contents of requested semaphore register, 

able physical memory in which the memory portions 302 opdef: SHR_TEST32[ShrReg] 

and 303 (shown in FIG. 3) are physically implemented. As SHR_TEST64[ShrReg] 
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Synonym: SHR_JlEAD32[ShrReg] 

SHR_READ64[ShrReg] 
Test&Set (WS-Olb) 

Read and icturo bit (2^) of requested semaphore register. 
Set semaphore register bit 2^ to a nonzero value (i.e.,-1). 

opdef: SHR_TSET32[ShrReg] 
SHR TSET64[ShrReg] 
Fetch&Increment (WS=10b) 

Read and return contents of requested semaphore register. 

opdef: SHR_JNC32[ShrReg] 
SHR_JNC64[ShrReg] 
32 or 64-bit signed increment of requested semaphore 
register after read. 
Fetch&Decrement (WS-llb) 

Read and return contents of requested semaphore register. 

opdef: SHR_DEC32[ShrReg] 
SHR_DEC64[ShrReg] 
32 or 64-bit signed decrement of requested semaphore 
register after read. 

The following briefly describes the write operations 
shown in Table, 1: 
ShrWrite (WS=00b) 

Store 32- or 64-bit data from write packet into requested 
semaphore register. 

opdef: SHR_WRITE32[ShrReg] 
SHR_WRITE64[ShrReg] 
NOTE: Setting a semaphore register is accomplished by 
writing any data value (register/inunediate) with bit 2°^1. 
Clear (WS=01b) 

Zero contents of requested 32- or 64-bit semaphore reg- 
ister. 

opdef SHR_CLR32[ShrReg] 
SHR_CLR64[ShrReg] 
AND (WS-lOb) 

AND 32- or 64-bit data from write packet with requested 
semaphore register, 
Semaphore-register»Semaphore__register AND Wrile- 
Packet-data. 
OR (WS-llb) 

OR 32- or 64-bit data from write packet with requested 
semaphore register. 
Semaphore_register=Semaphore-register OR Write- 
Packet-data. 

Table 2 sets out examples of memory references and 
corresponding semaphore operation using the Intel Archi- 
tecture 32; (IA32) instruction set. In FIG. 2 "%edi" points to 
base of current assigned cluster, which is a 4KB semaphore 
region: 

TABLE 2 



Processor Operation 
(i.e. memory reference) 



Operation 



movl 0 (%cdi), %cax 
movl 8 (%cdi), %cax 
movl 16 (%edi), %eax 
movl 24 (%edO, %eax 
movl %cax, 0 (%edi) 
movl %eax, 8 (%edO 

movl $0X55555555, 16 
(%edi) 

movl SOXAAAAAAAA, 24 



SHR_TEST32 [0] (ShrRcad) 
SHR_TSET32FOI 32 bit & set - 1 
SHIL-INC32 10] fetch and increment 
SH1L-DEC32 [0] fetch and deaement 
Write contents of %eax (ShrWrite) 
SHR_CLR32 [01 clear (%cax 
ignored) 

Clear all odd bits in semaphore 
register 0 

Set all odd bits in semaphore 
register 0 
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Atomic operations can be completed in one memory 
reference without ever asserting a bus lock. Hence, the 



hardware semaphore implementation in accordance with the 
present invention has approximately half the memory traffic 
of conventional uncached atomic operations and potentially 
greater reductions in memory/coherency traffic for cached 
semaphores. Any semaphore reference is completed without 
ever asserting a memory bus lode, thus allowing other bus 
agents access to memory resources. This implementation 
alleviates the need for a third network. Moreover, the present 
invention uses existing memory management capabilities to 
map multiple processors to one memory space (multiple 
physical processors accessing a common cluster). Any 
atomic operation to a specific semaphore register from one 
or more concurrently referencing processor is completed in 
one memory reference. No hardware deadlock conditions 
are likely which eliminates the need for costly and complex 
logic to delect a deadlock situation between two or more 
processors. 

Although the invention has been described and illustrated 
with a certain degree of particularity, it is understood that the 
present disclosure has been made only by way of example, 
and that numerous changes in the combination and arrange- 
ment of parts can be resorted to by those skilled in the art 
without departing from the spirit and scope of the invention, 
as hereinafter claimed. 

We claim: 

1. A multiprocessor data processing system comprising: 
a plurality of microprocessors; 

a plurality of memory banks having a shared address 
space; 

a network coupling the memory banks and the micropro- 
processors to enable memory operation messages to be 
communicated between the memory and the micropro- 
cessors; 

a first portion of the shared address space allocated to 
conventional memory operations; 

a plurality of semaphore registers implemented within a 
second portion of the shared address space of the 
memory banks, wherein the semaphore register are 
accessible by the microprocessors through the network; 
and 

a bridge controller coupled to the memory banks and 
operable to prevent any cached read/write or uncached 
read operation to proceed until all semaphore write 
operations have completed. 

2. The system of claim 1 wherein the semaphore registers 
are implemented in a fixed range of the memory address 
space allocated to each of the memory banks. 

3. The system of claim 1 wherein the semaphore registers 
are assigned at runtime to specific software processes. 

4. The system of claim 1 wherein the portion of the shared 
address space in which the semaphore registers are imple- 
mented is uncacheable. 

5. The system of claim 1 wherein the semaphore registers 
support atomic operations including test, set, test&set, clear, 
signed increment, signed decrement, and shared readAvrite. 

6. The system of claim 5 wherein the atomic operations 
are encoded into an address specifying the semaphore using 
a read or write memory operation natively supported by the 
microprocessors. 

7. A method of communicating state information in a 
multiprocessor computing system comprising: 

providing a plurality of microprocessors generating 
memory requests, each memory request specifying an 
address within a shared space; 

allocating a portion of the shared address space in each 
memory bank to semaphore registers; and 
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accessing the state iaformation by any of the plurality of 11. The method of claim 7 wherein the step of accessing 

microprocessors using memory operations specifying a the state information comprises reading and returning a 

target address within the portion of the shared address specified bit within the specified semaphore register fol- 

^ace allocated to the semaphore registers; lowed by setting the specified bit to a nonzero value. 

specifying a virtual base portion in the memory request 5 12. The method of claim 7 wherein the step of accessing 
containing a value indicating a base address in which a the state information comprises reading and returning con- 
cluster of semaphore registers resides; tents of requested semaphore register followed by incre- 

specifying a semaphore identification portion in the menting the requested semaphore register. 

memory request containing a value indicating a par- 13. The method of claim 7 wherein the step of accessing 

ticular semaphore register within the cluster of sema- the state information comprises reading and returning con- 

phore registers; and tents of requested semaphore register followed by decre- 

specifying a value indicating a particular operation to be menting the value of the specified semaphore register, 
performed on the semaphore specified by the virtual 14. The method of claim 7 wherein the step of accessing 

base portion and the semaphore identification portion. the state information comprises storing data specified in the 

8. The method of claim 7 further comprising a step of memory request into requested semaphore register, 
designating the portion of the shared address space allocated 15. The method of claim 7 wherein the step of accessing 
to semaphore registers as uncacheable, the state information comprises setting the contents of the 

9. The method of claim 7 further comprising: semaphore register to zero. 

executing a plurality of software processes on one or more 20 16. The method of claim 7 wherein the step of accessing 

of the plurality of microprocessors; the state information comprises performing a logical AND 

at runtime, allocating a portion of the shared address operation between data specified in the request and a value 

space to the plurality of processes; and stored in the specified semaphore register. 

mapping at least one of the physical semaphore register 17. The method of claim 7 wherein the step of accessing 
sets into the common address space allocated to the 25 the state information comprises performing a logical OR 

plurality of processes. operation between data specified in the request and a value 

10. The method of claim 7 wherein the step of accessing stored in the specified semaphore register, 
the state information comprises reading and remming con- 
tents of the specified semaphore register. » ♦ ♦ * * 
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